Skip to content

wip: LLMs.txt toolkit local history from pr-7465 worktree#7541

Draft
Mustafa-Esoofally wants to merge 26 commits intomainfrom
wip/cleanup-llms-txt-history-20260415
Draft

wip: LLMs.txt toolkit local history from pr-7465 worktree#7541
Mustafa-Esoofally wants to merge 26 commits intomainfrom
wip/cleanup-llms-txt-history-20260415

Conversation

@Mustafa-Esoofally
Copy link
Copy Markdown
Contributor

Summary

Preserves the granular local history of feat/llms-txt-reader-tools from the pr-7465-llms-txt-fixes worktree. PR #7458 squash-merged this work, so main already has the feature — this branch keeps the 25 original commits for reference (review iteration history, type cleanup, import fixes, etc.).

Also captures an unrelated dirty file that was in the worktree:

  • libs/agno/tests/unit/os/routers/test_sort_order_default.py — cross-contamination from another worktree, triage separately.

Status

Safe to close. PR #7458 already merged the work. This exists only so nothing gets lost during worktree cleanup.

ashpreetbedi and others added 26 commits April 10, 2026 12:44
Add a reader and toolkit for the llms.txt standard (https://llmstxt.org),
enabling agents to discover and consume documentation indexes.

LLMsTxtReader: fetches an llms.txt URL, parses the standardized markdown
format to extract all linked doc URLs, fetches page content (handling HTML,
markdown, plain text), and returns Documents with section/title metadata.
Async variant fetches all pages concurrently.

LLMsTxtTools provides two modes:
- Agentic: get_llms_txt_index returns the index so the agent picks which
  pages to read, then read_llms_txt_url fetches individual pages.
- Knowledge: read_llms_txt_and_load_knowledge bulk-fetches all linked
  pages and inserts them into a Knowledge base.

Includes 32 unit tests and 2 cookbook examples.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary

Addresses code review feedback on #7458. Fixes several issues in the
LLMsTxtReader and LLMsTxtTools implementation.

**Changes:**
- **Lazy BeautifulSoup import** - Deferred to `_extract_content()`
instead of hard-failing at module import time
- **Variable shadowing fix** - Renamed `url` to `entry_url` in
`async_read()` dict comprehension to avoid shadowing the method
parameter
- **Concurrency limiting** - Added `asyncio.Semaphore(10)` to prevent
overwhelming target servers when fetching 100+ URLs concurrently
- **Better text extraction** - Changed `_extract_content()` separator
from `" "` to `"\n"` to preserve document structure
- **Public API methods** - Renamed `_fetch_url` / `_parse_llms_txt` to
`fetch_url` / `parse_llms_txt` since they are called by the toolkit
- **Reader reuse** - LLMsTxtTools now creates a single `LLMsTxtReader`
instance in `__init__` instead of per tool call
- **Async tool variants** - Added `aget_llms_txt_index`,
`aread_llms_txt_url`, `aread_llms_txt_and_load_knowledge` registered via
`async_tools` following the codebase convention (e.g. BrandfetchTools)
- **New tests** - Added tests for async tool registration, reader reuse,
and newline preservation in HTML extraction

## Type of change

- [x] Improvement

---

## Checklist

- [x] Code complies with style guidelines
- [x] Ran format/validation scripts (`./scripts/format.sh` and
`./scripts/validate.sh`)
- [x] Self-review completed
- [x] Documentation updated (comments, docstrings)
- [x] Tests added/updated (if applicable)

### Duplicate and AI-Generated PR Check

- [x] I have searched existing [open pull requests](../../pulls) and
confirmed that no other PR already addresses this issue
- [x] Check if this PR was entirely AI-generated (by Copilot, Claude
Code, Cursor, etc.)

---

## Additional Notes

All 36 tests pass (up from 32 - added 4 new tests for async
registration, reader reuse, and HTML newline preservation).
- Full async docstrings on all 3 async tool methods so the LLM sees
  proper tool descriptions in async mode
- AsyncClient now receives timeout and proxy via _async_client_kwargs()
- Module-level httpx import consistent with Brandfetch/Perplexity
- Extract _process_response() to deduplicate content-type classification
  across fetch_url and async_fetch_url
Instead of manually reading documents and looping insert(), delegate
to self.knowledge.insert(url=url, reader=self.reader) which gives us
content hashing, deduplication, status tracking, and proper vector DB
insertion — matching the pattern used by WebsiteTools and WikipediaTools.
Reader:
- Remove redundant state: in_optional and past_first_section replaced
  by single current_section variable
- Remove dead if/else branch on proxy — httpx accepts proxy=None
- Remove WHAT comments that restate the next line
- Simplify AsyncClient construction (proxy=self.proxy directly)

Toolkit:
- Extract _format_index helper to deduplicate sync/async index building
- Delegate knowledge loading to Knowledge.insert(url=, reader=) pipeline

Knowledge:
- Skip pre-download when custom reader is provided — URL-based readers
  like LLMsTxtReader need the URL string, not pre-fetched BytesIO
The overview document (title + summary from the llms.txt) provides
essential context about the project. No caller ever set this to False.
Removing the parameter and its branch simplifies the reader.
- Remove __init__ docstring (no other reader has one)
- Rewrite parse_llms_txt: replace 3 continue statements with clean
  if/elif/else chain — each line falls into one bucket
- Remove include_llms_txt_content param (always True, never exposed)
_extract_content was called exactly once. Inlining removes one
indirection layer — the reader now has only the helpers that are
actually shared between read() and async_read().
The 3-way exception split (HTTPStatusError, RequestError, Exception)
was duplicated between sync and async. For a reader fetching doc pages,
a single catch with a warning log is sufficient. Each method is now
4 lines instead of 12.
Keep the semaphore (Codex confirms: this is external HTTP fan-out, not
local processing — unbounded gather would burst 100 requests at once).
Remove _MAX_CONCURRENT_FETCHES constant, inline the value with a comment
explaining why it exists.
Add timeout and follow_redirects params to existing fetch_with_retry
and async_fetch_with_retry in utils/http.py. Reader now uses these
shared utils instead of making raw httpx.get calls — retry logic,
error handling, and connection management in one place.

Removed semaphore — httpx AsyncClient already limits concurrent
connections per host (default 20).
max_urls=100 was too high — would overwhelm model context in agentic
mode. 20 matches the knowledge cookbook and WebsiteReader's max_links=10
ballpark. timeout=60 matches the global httpx client default.
bs4 import now fails at import time (matching WebsiteReader and
WebSearchReader pattern) instead of deep inside a fetch call.
LLMsTxtReader import moved to top of toolkit — no reason to defer
an internal agno module.
Class docstring was a 30-line essay — most toolkits have none.
The code structure already shows the two modes (with/without knowledge).
Removed remaining WHAT comment in _build_documents.
- Trim tool docstrings: remove repeated llms.txt explanations, keep
  only what the LLM needs to decide when/how to call the tool
- Replace _async_client_kwargs dict builder with _async_client() that
  returns the client directly
- Add section comments to separate helpers / agentic tools / knowledge
  tools for scannable code
- Remove unused Dict import
Docstrings now use the same format as GmailTools and GoogleCalendarTools:
triple-quote, Args (type): description, Returns: type: description.
Replaced section dividers with inline comments matching Gmail pattern.
Helpers have no docstrings (underscore prefix signals internal use).
Toolkit: every tool method now wrapped in try/except returning error
strings, matching Gmail/Calendar pattern. Helpers at top, tools below.

Reader: reordered — __init__, classmethods, helpers (_process_response,
_build_documents), then public methods (parse_llms_txt, fetch_url,
read, async_read). Removed bloated docstrings on helpers. Trimmed
class docstring to just the example.
tools list uses Callable instead of Any. Removed Any from kwargs
(untyped kwargs is the codebase pattern — other toolkits don't type it).
Restructured from class-based to flat functions with @pytest.fixture,
matching test_perplexity.py and test_gmail_tools.py patterns.

New coverage:
- Async reader: async_read happy path + failure
- Async toolkit: aget_llms_txt_index, aread_llms_txt_url,
  aread_llms_txt_and_load_knowledge
- Error handling: try/except returns error strings
- Edge cases: empty overview, HTML sniffing, unknown content-type
- Shared _mock_httpx_response helper for DRY mock setup

34 tests -> 46 tests
The previous fix (skip pre-download when any custom reader is provided)
broke PDFReader and other file-based readers that need BytesIO. Now we
check if the reader supports ContentType.URL — only URL-based readers
like LLMsTxtReader and WebsiteReader skip the pre-download. File-based
readers (PDFReader, CSVReader, etc.) still get pre-downloaded bytes.
Only forward timeout and follow_redirects to httpx when explicitly
passed by the caller. Previously, default values (timeout=None,
follow_redirects=False) were always forwarded, which removed httpx's
built-in 5s timeout and overrode client-level redirect settings.
follow_redirects and timeout use Optional[None] default so existing
callers see zero behavior change. Build kwargs dict conditionally
instead of type-ignore comments. Import order fixed by format.sh.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants